魅影直播 1999 Benchmarking Science Report Appendix A

Table of Contents
Executive Summary
Introduction
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7

Reference 1
Reference 2
Reference 3
Reference 4

Appendix A Contents
	•History
	•Participants in 魅影直播 Benchmarking
	•Developing the 魅影直播 1999 Science Test
	•魅影直播 Test Design
	•Background Questionnaires
	•Translation and Verification
	•Population Definition and Sampling
	•Data Collection
	•Scoring the Free-Response items
	•Test Reliability
	•Data Processing
	•IRT Scaling and Data Analysis
	•Estimating Sampling Error
	•Making Multiple Comparisons
	•Setting International Benchmarks of Science Achievement
	•Science Curriculum Questionnaire

Appendix B
Appendix C
Appendix D
Appendix E

History

魅影直播 1999 represents the continuation of a long series of studies conducted by the International Association for the Evaluation of Educational Achievement (IEA). Since its inception in 1959, the IEA has conducted more than 15 studies of cross-national achievement in the curricular areas of mathematics, science, language, civics, and reading. The Third International Mathematics and Science Study (魅影直播), conducted in 1994-1995, was the largest and most complex IEA study, and included both mathematics and science at third and fourth grades, seventh and eighth grades, and the final year of secondary school. In 1999, 魅影直播 again assessed eighth-grade students in both mathematics and science to measure trends in student achievement since 1995. 魅影直播 1999 was also known as 魅影直播-Repeat, or 魅影直播-R.(1)

To provide U.S. states and school districts with an opportunity to benchmark the performance of their students against that of students in the high-performing 魅影直播 countries, the International Study Center at Boston College, with the support of the National Center for Education Statistics and the National Science Foundation, established the 魅影直播 1999 Benchmarking Study. Through this project, the 魅影直播 mathematics and science achievement tests and questionnaires were administered to representative samples of students in participating states and school districts in the spring of 1999, at the same time the tests and questionnaires were administered in the 魅影直播 countries. Participation in 魅影直播 Benchmarking was intended to help states and districts understand their comparative educational standing, assess the rigor and effectiveness of their own mathematics and science programs in an international context, and improve the teaching and learning of mathematics and science.

Participants in 魅影直播 Benchmarking

Thirteen states availed of the opportunity to participate in the Benchmarking Study. Eight public school districts and six consortia also participated, for a total of fourteen districts and consortia. They are listed in Exhibit 1 of the Introduction, together with the 38 countries that took part in 魅影直播 1999.

Developing the 魅影直播 1999 Science Test

The 魅影直播 curriculum framework underlying the science tests was
developed for 魅影直播 in 1995 by groups of science educators with input from the 魅影直播 National Research Coordinators (NRCS). As shown in Exhibit A.1, the science curriculum framework contains three dimensions or aspects. The content aspect represents the subject matter content of school science. The performance expectations aspect describes, in a non-hierarchical way, the many kinds of performances or behaviors that might be expected of students in school science. The perspectives aspect focuses on the development of students’ attitudes, interest, and motivation in science. Because the frameworks were developed to include content, performance expectations, and perspectives for the entire span of curricula from the beginning of schooling through the completion of secondary school, some aspects may not be reflected in the eighth-grade 魅影直播 assessment.(2) Working within the framework, science test specifications for 魅影直播 in 1995 were developed that included items representing a wide range of science topics and eliciting a range of skills from the students. The 1995 tests were developed through an international consensus involving input from experts in science and measurement specialists, ensuring they reflected current thinking and priorities in the sciences.

About one-third of the items in the 1995 assessment were kept secure to measure trends over time; the remaining items were released for public use. An essential part of the development of the 1999 assessment, therefore, was to replace the released items with items of similar content, format, and difficulty. With the assistance of the Science and Mathematics Item Replacement Committee, a group of internationally prominent mathematics and science educators nominated by participating countries to advise on subject-matter issues in the assessment, over 300 mathematics and science items were developed as potential replacements. After an extensive process of review and field testing, 98 items were selected for use as replacements in the 1999 science assessment.

Exhibit A.2 presents the six content areas included in the 1999 science test and the numbers of items and score points in each area. Distributions are also included for the five performance categories derived from the performance expectations aspect of the curriculum framework. About one-fourth of the items were in the free-response format, requiring students to generate and write their own answers. Designed to take about one-third of students’ test time, some free-response questions asked for short answers while others required extended responses with students showing their work or providing explanations for their answers. The remaining questions used a multiple-choice format. In scoring the tests, correct answers to most questions were worth one point. Consistent with the approach of allotting students longer response time for the constructed-response questions than for multiple-choice questions, however, responses to some of these questions (particularly those requiring extended responses) were evaluated for partial credit, with a fully correct answer being awarded two points (see later section on scoring). The total number of score points available for analysis thus somewhat exceeds the number of items.

Every effort was made to help ensure that the tests represented the curricula of the participating countries and that the items exhibited no bias towards or against particular countries. The final forms of the tests were endorsed by the NRCS of the participating countries.(3)

魅影直播 Test Design

Not all of the students in the 魅影直播 assessment responded to all of the science items. To ensure broad subject-matter coverage without overburdening individual students, 魅影直播 used a rotated design that included both the mathematics and science items. Thus, the same students participated in both the mathematics and science testing. As in 1995, the 1999 assessment consisted of eight booklets, each requiring 90 minutes of response time. Each participating student was assigned one booklet only. In accordance with the design, the mathematics and science items were assembled into 26 clusters (labeled A through Z). The secure trend items were in clusters A through H, and items replacing the released 1995 items in clusters I through Z. Eight of the clusters were designed to take 12 minutes to complete; 10 of the clusters, 22 minutes; and 8 clusters, 10 minutes. In all, the design provided 396 testing minutes, 198 for mathematics and 198 for science. Cluster A was a core cluster assigned to all booklets. The remaining clusters were assigned to the booklets in accordance with the rotated design so that representative samples of students responded to each cluster.(4)

Background Questionnaires

魅影直播 in 1999 administered a broad array of questionnaires to collect data on the educational context for student achievement and to measure trends since 1995. National Research Coordinators, with the assistance of their curriculum experts, provided detailed information on the organization, emphases, and content coverage of the mathematics and science curriculum. The students who were tested answered questions pertaining to their attitudes towards mathematics and science, their academic self-concept, classroom activities, home background, and out-of-school activities. The mathematics and science teachers of sampled students responded to questions about teaching emphasis on the topics in the curriculum frameworks, instructional practices, professional training and education, and their views on mathematics and science. The heads of schools responded to questions about school staffing and resources, mathematics and science course offerings, and teacher support.

Translation and Verification

The 魅影直播 instruments were prepared in English and translated into 33 languages, with 10 of the 38 countries collecting data in two languages. In addition, it sometimes was necessary to modify the international versions for cultural reasons, even in the nine countries that tested in English. This process represented an enormous effort for the national centers, with many checks along the way. The translation effort included (1) developing explicit guidelines for translation and cultural adaptation; (2) translation of the instruments by the national centers in accordance with the guidelines, using two or more independent translations; (3) consultation with subject-matter experts on cultural adaptations to ensure that the meaning and difficulty of items did not change; (4) verification of translation quality by professional translators from an independent translation company; (5) corrections by the national centers in accordance with the suggestions made; (6) verification by the International Study Center that corrections were made; and (7) a series of statistical checks after the testing to detect items that did not perform comparably across countries.(5)

Population Definition and Sampling

魅影直播 in 1995 had as its target population students enrolled in the two adjacent grades that contained the largest proportion of 13-year-old students at the time of testing, which were seventh- and eighth-grade students in most countries. 魅影直播 in 1999 used the same definition to identify the target grades, but assessed students in the upper of the two grades only, which was the eighth grade in most countries, including the United States.(6) The eighth grade was the target population for all of the Benchmarking participants.

The selection of valid and efficient samples was essential to the success of 魅影直播 and of the Benchmarking Study. For 魅影直播 internationally, NRCS, including Westat, the sampling and data collection coordinator for 魅影直播 in the United States, received training in how to select the school and student samples and in the use of the sampling software, and worked in close consultation with Statistics Canada, the 魅影直播 sampling consultants, on all phases of sampling. As well as conducting the sampling and data collection for the U.S. national 魅影直播 sample, Westat was also responsible for sampling and data collection in each of the Benchmarking states, districts, and consortia.

To document the quality of the school and student samples in each of the 魅影直播 countries, staff from Statistics Canada and the International Study Center worked with the 魅影直播 sampling referee (Keith Rust, Westat) to review sampling plans, sampling frames, and sampling implementation. Particular attention was paid to coverage of the target population and to participation by the sampled schools and students. The data from the few countries that did not fully meet all of the sampling guidelines are annotated in the 魅影直播 international reports, and are also annotated in this report. The 魅影直播 samples for the Benchmarking participants were also carefully reviewed in light of the 魅影直播 sampling guidelines, and the results annotated where appropriate. Since Westat was the sampling contractor for the Benchmarking project, the role of sampling referee for the Benchmarking review was filled by Pierre Foy, of Statistics Canada.

Although all countries and Benchmarking participants were expected to draw samples representative of the entire internationally desired population (all students in the upper of the two adjacent grades with the greatest proportion of 13-year-olds), the few countries where this was not possible were permitted to define a national desired population that excluded part of the internationally desired population. Exhibit A.3 shows any differences in coverage between the international and national desired populations. Almost all 魅影直播 countries achieved 100 percent coverage (36 out of 38), with Lithuania and Latvia the exceptions. Consequently, the results for Lithuania are annotated, and because coverage fell below 65 percent for Latvia, the Latvian results are labeled “Latvia (LSS),” for Latvian-Speaking Schools. Additionally, because of scheduling difficulties, Lithuania was unable to test its eighth-grade students in May 1999 as planned. Instead, the students were tested in September 1999, when they had moved into the ninth grade. The results for Lithuania are annotated to reflect this as well. Exhibit A.3 also shows that the sampling plans for the Benchmarking participants all incorporated 100 percent coverage of the desired population. Four of the 13 states (Idaho, Indiana, Michigan, and Pennsylvania) as well as the Southwest Pennsylvania Math and Science Collaborative included private schools as well as public schools.

In operationalizing their desired eighth-grade population, countries and Benchmarking participants could define a population to be sampled that excluded a small percentage (less than 10 percent) of certain kinds of schools or students that would be very difficult or resource-intensive to test (e.g., schools for students with special needs or schools that were very small or located in extremely rural areas). Exhibit A.3 also shows that the degree of such exclusions was small. Among countries, only Israel reached the 10 percent limit, and among Benchmarking participants, only Guilford County and Montgomery County did so. All three are annotated as such in the achievement chapters of this report.

Within countries, 魅影直播 used a two-stage sample design, in which the first stage involved selecting about 150 public and private schools in each country. Within each school, countries were to use random procedures to select one mathematics class at the eighth grade. All of the students in that class were to participate in the 魅影直播 testing. This approach was designed to yield a representative sample of about 3,750 students per country. Typically, between 450 and 3,750 students responded to each achievement item in each country, depending on the booklets in which the items appeared.

States participating in the Benchmarking study were required to sample at least 50 schools and approximately 2,000 eighth-grade students. School districts and consortia were required to sample at least 25 schools and at least 1,000 students. Where there were fewer than 25 schools in a district or consortium, all schools were to be included, and the within-school sample increased to yield the total of 1,000 students.

Exhibits A.4 and A.5 present achieved sample sizes for schools and students, respectively, for the 魅影直播 countries and for the Benchmarking participants. Where a district or consortium was part of a state that also participated, the state sample was augmented by the district or consortium sample, properly weighted in accordance with its size. Schools in a state that were sampled as part of the U.S. national 魅影直播 sample were also used to augment the state sample. For example, the Illinois sample consists of 90 schools, 41 from the state Benchmarking sample (including five schools from the national 魅影直播 sample), 27 from the Chicago Public Schools, 17 from the First in the World Consortium, and five from the Naperville School District.

Exhibit A.6 shows the participation rates for schools, students, and overall, both with and without the use of replacement schools, for 魅影直播 countries and Benchmarking participants. All of the countries met the guideline for sampling participation – 85 percent of both the schools and students, or a combined rate (the product of school and student participation) of 75 percent – although Belgium (Flemish), England, Hong Kong, and the Netherlands did so only after including replacement schools, and are annotated accordingly in the achievement chapters.

With the exception of Pennsylvania and Texas, all the Benchmarking participants met the sampling guidelines, although Indiana did so only after including replacement schools. Indiana is annotated to reflect this in the achievement chapters, and Pennsylvania and Texas are italicized in all exhibits in this report.

Data Collection

Each participating country was responsible for carrying out all aspects of the data collection, using standardized procedures developed for the study. Training manuals were created for school coordinators and test administrators that explained procedures for receipt and distribution of materials as well as for the activities related to the testing sessions. These manuals covered procedures for test security, standardized scripts to regulate directions and timing, rules for answering students’ questions, and steps to ensure that identification on the test booklets and questionnaires corresponded to the information on the forms used to track students. As the data collection contractor for the U.S. national 魅影直播, Westat was fully acquainted with the 魅影直播 procedures, and applied them in each of the Benchmarking jurisdictions in the same way as in the national data collection.

Each country was responsible for conducting quality control procedures and describing this effort in the NRC’s report documenting procedures used in the study. In addition, the International Study Center considered it essential to monitor compliance with standardized procedures through an international program of quality control site visits. NRCS were asked to nominate one or more persons unconnected with their national center, such as retired school teachers, to serve as quality control monitors for their countries. The International Study Center developed manuals for the monitors and briefed them in two-day training sessions about 魅影直播, the responsibilities of the national centers in conducting the study, and their own roles and responsibilities. In all, 71 international quality control monitors participated in this training.

The international quality control monitors interviewed the NRCS about data collection plans and procedures. They also visited a sample of 15 schools where they observed testing sessions and interviewed school coordinators.(7) Quality control monitors interviewed school coordinators in all 38 countries, and observed a total of 550 testing sessions. The results of the interviews conducted by the international quality control monitors indicated that, in general, NRCS had prepared well for data collection and, despite the heavy demands of the schedule and shortages of resources, were able to conduct the data collection efficiently and professionally. Similarly, the 魅影直播 tests appeared to have been administered in compliance with international procedures, including the activities before the testing session, those during testing, and the school-level activities related to receiving, distributing, and returning material from the national centers.

As a parallel quality control effort for the Benchmarking project, the International Study Center recruited and trained a team of 18 quality control observers, and sent them to observe the data collection activities of the Westat test administrators in a sample of about 10 percent of the schools in the study (98 schools in all).(8) In line with the experience internationally, the observers reported that the data collection was conducted successfully according to the prescribed procedures, and that no serious problems were encountered.

Scoring the Free-Response Items

Because about one-third of the written test time was devoted to free-response items, 魅影直播 needed to develop procedures for reliably evaluating student responses within and across countries. Scoring used two-digit codes with rubrics specific to each item. The first digit designates the correctness level of the response. The second digit, combined with the first, represents a diagnostic code identifying specific types of approaches, strategies, or common errors and misconceptions. Although not used in this report, analyses of responses based on the second digit should provide insight into ways to help students better understand science concepts and problem-solving approaches.

To ensure reliable scoring procedures based on the 魅影直播 rubrics, the International Study Center prepared detailed guides containing the rubrics and explanations of how to implement them, together with example student responses for the various rubric categories. These guides, along with training packets containing extensive examples of student responses for practice in applying the rubrics, were used as a basis for intensive training in scoring the free-response items. The training sessions were designed to help representatives of national centers who would then be responsible for training personnel in their countries to apply the two-digit codes reliably. In the United States, the scoring was conducted by National Computer Systems (NCS) under contract to Westat. To ensure that student responses from the Benchmarking participants were scored in the same way as those from the U.S. national sample, NCS had both sets of data scored at the same time and by the same scoring staff.

To gather and document empirical information about the within-country agreement among scorers, 魅影直播 arranged to have systematic subsamples of at least 100 students’ responses to each item coded independently by two readers. Exhibit A.7 shows the average and range of the within-country percent of exact agreement between scorers on the free-response items in the science test for 37 of the 38 countries. A high percentage of exact agreement was observed, with an overall average of 95 percent across the 37 countries. The 魅影直播 data from the reliability studies indicate that scoring procedures were robust for the science items, especially for the correctness score used for the analyses in this report. In the United States, the average percent exact agreement was 94 percent for the correctness score and 89 percent for the diagnostic score. Since the Benchmarking data were combined with the U.S. national 魅影直播 sample for scoring purposes, this high level of scoring reliability applies to the Benchmarking data also.

Test Reliability

Exhibit A.8 displays the science test reliability coefficient for each country and Benchmarking participant. This coefficient is the median KR-20 reliability across the eight test booklets. Among countries, median reliabilities ranged from 0.62 in Morocco to 0.86 in Singapore. The international median, 0.80, is the median of the reliability coefficients for all countries. Reliability coefficients among Benchmarking participants were generally close to the international median, ranging from 0.82 to 0.86 across states, and from 0.77 to 0.85 across districts and consortia.

Data Processing

To ensure the availability of comparable, high-quality data for analysis, 魅影直播 took rigorous quality control steps to create the international database.(9) 魅影直播 prepared manuals and software for countries to use in entering their data, so that the information would be in a standardized international format before being forwarded to the IEA Data Processing Center in Hamburg for creation of the international database. Upon arrival at the Data Processing Center, the data underwent an exhaustive cleaning process. This involved several iterative steps and procedures designed to identify, document, and correct deviations from the international instruments, file structures, and coding schemes. The process also emphasized consistency of information within national data sets and appropriate linking among the many student, teacher, and school data files. In the United States, the creation of the data files for both the Benchmarking participants and the U.S. national 魅影直播 effort was the responsibility of Westat, working closely with NCS. After the data files were checked carefully by Westat, they were sent to the IEA Data Processing Center, where they underwent further validity checks before being forwarded to the International Study Center.

IRT Scaling and Data Analysis

The general approach to reporting the 魅影直播 achievement data was based primarily on item response theory (IRT) scaling methods.(10) The science results were summarized using a family of 2-parameter and 3-parameter IRT models for dichotomously-scored items (right or wrong), and generalized partial credit models for items with 0, 1, or 2 available score points. The IRT scaling method produces a score by averaging the responses of each student to the items that he or she took in a way that takes into account the difficulty and discriminating power of each item. The methodology used in 魅影直播 includes refinements that enable reliable scores to be produced even though individual students responded to relatively small subsets of the total science item pool. Achievement scales were produced for each of the six science content areas (earth science, life science, physics, chemistry, environmental and resource issues, and scientific inquiry and the nature of science), as well as for science overall.

The IRT methodology was preferred for developing comparable estimates of performance for all students, since students answered different test items depending upon which of the eight test booklets they received. The IRT analysis provides a common scale on which performance can be compared across countries. In addition to providing a basis for estimating mean achievement, scale scores permit estimates of how students within countries vary and provide information on percentiles of performance. To provide a reliable measure of student achievement in both 1999 and 1995, the overall science scale was calibrated using students from the countries that participated in both years. When all countries participating in 1995 at the eighth grade are treated equally, the 魅影直播 scale average over those countries is 500 and the standard deviation is 100. Since the countries varied in size, each country was weighted to contribute equally to the mean and standard deviation of the scale. The average and standard deviation of the scale scores are arbitrary and do not affect scale interpretation. When the metric of the scale had been established, students from the countries that tested in 1999 but not 1995 were assigned scores on the basis of the new scale. IRT scales were also created for each of the six science content areas for the 1999 data. Students from the Benchmarking samples were assigned scores on the overall science scale as well as in each of the six science content areas using the same item parameters and estimation procedures as for 魅影直播 internationally.

To allow more accurate estimation of summary statistics for student subpopulations, the 魅影直播 scaling made use of plausible-value technology, whereby five separate estimates of each student’s score were generated on each scale, based on the student’s responses to the items in the student’s booklet and the student’s background characteristics. The five score estimates are known as “plausible values,” and the variability between them encapsulates the uncertainty inherent in the score estimation process.

Estimating Sampling Error

Because the statistics presented in this report are estimates of performance based on samples of students, rather than the values that could be calculated if every student in every country or Benchmarking jurisdiction had answered every question, it is important to have measures of the degree of uncertainty of the estimates. The jackknife procedure was used to estimate the standard error associated with each statistic presented in this report.(11) The jackknife standard errors also include an error component due to variation between the five plausible values generated for each student. The use of confidence intervals, based on the standard errors, provides a way to make inferences about the population means and proportions in a manner that reflects the uncertainty associated with the sample estimates. An estimated sample statistic plus or minus two standard errors represents a 95 percent confidence interval for the corresponding population result.

Making Multiple Comparisons

This report makes extensive use of statistical hypothesis-testing to provide a basis for evaluating the significance of differences in percentages and in average achievement scores. Each separate test follows the usual convention of holding to 0.05 the probability that reported differences could be due to sampling variability alone. However, in exhibits where statistical significance tests are reported, the results of many tests are reported simultaneously, usually at least one for each country and Benchmarking participant in the exhibit. The significance tests in these exhibits are based on a Bonferroni procedure for multiple comparisons that hold to 0.05 the probability of erroneously declaring a statistic (mean or percentage) for one entity to be different from that for another entity. In the multiple comparison charts (Exhibit 1.2 and those in Appendix B), the Bonferroni procedure adjusts for the number of entities in the chart, minus one. In exhibits where a country or Benchmarking participant statistic is compared to the international average, the adjustment is for the number of entities.(12)

Setting International Benchmarks of Student Achievement

International benchmarks of student achievement were computed at each grade level for both mathematics and science. The benchmarks are points in the weighted international distribution of achievement scores that separate the 10 percent of students located on top of the distribution, the top 25 percent of students, the top 50 percent, and the bottom 25 percent. The percentage of students in each country and Benchmarking jurisdiction meeting or exceeding the international benchmarks is reported. The benchmarks correspond to the 90th, 75th, 50th, and 25th percentiles of the international distribution of achievement. When computing these percentiles, each country contributed as many students to the distribution as there were students in the target population in the country. That is, each country’s contribution to setting the international benchmarks was proportional to the estimated population enrolled at the eighth grade.

In order to interpret the 魅影直播 scale scores and analyze achievement at the international benchmarks, 魅影直播 conducted a scale anchoring analysis to describe achievement of students at those four points on the scale. Scale anchoring is a way of describing students’ performance at different points on a scale in terms of what they know and can do. It involves a statistical component, in which items that discriminate between successive points on the scale are identified, and a judgmental component in which subject-matter experts examine the items and generalize to students’ knowledge and understandings.(13)

Science Curriculum Questionnaire

In an effort to collect information about the content of the intended curriculum in science, 魅影直播 asked National Research Coordinators and Coordinators from the Benchmarking jurisdictions to complete a questionnaire about the structure, organization, and content coverage of their curricula. Coordinators reviewed 42 science topics and reported the percentage of their eighth-grade students for which each topic was intended in their curriculum. Although most topic descriptions were used without modification, there were occasions when Coordinators found it necessary to expand on or qualify the topic description to describe their situation accurately. The country-specific adaptations to the science curriculum questionnaire are presented in Exhibit A.9. No adaptations to the list of topics were necessary for the U.S. national version. Among Benchmarking participants, seven of the states and none of the districts or consortia made adaptations, and these are shown in Exhibit A.10.

Footnotes

1	The 魅影直播 1999 results for mathematics and science, respectively, are reported in Mullis, I.V.S., Martin, M.O., Gonzalez, E.J., Gregory, K.D., Garden, R.A., O’Connor, K.M., Chrostowski, S.J., and Smith, T.A. (2000), 魅影直播 1999 International Mathematics Report: Findings from IEA’s Repeat of the Third International Mathematics and Science Study at the Eighth Grade, Chestnut Hill, MA: Boston College, and in Martin, M.O., Mullis, I.V.S., Gonzalez, E.J., Gregory, K.D., Smith, T.A., Chrostowski, S.J., Garden, R.A., and O’Connor, K.M. (2000), 魅影直播 1999 International Science Report: Findings from IEA’s Repeat of the Third International Mathematics and Science Study at the Eighth Grade, Chestnut Hill, MA: Boston College
2	The complete 魅影直播 curriculum frameworks can be found in Robitaille, D.F., et al. (1993), 魅影直播 Monograph No.1: Curriculum Frameworks for Mathematics and Science, Vancouver, BC: Pacific Educational Press.
3	For a full discussion of the 魅影直播 1999 test development effort, please see Garden, R.A. and Smith, T.A. (2000), “魅影直播 Test Development” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
4	The 1999 魅影直播 test design is identical to the design for 1995, which is fully documented in Adams, R. and Gonzalez, E. (1996), “魅影直播 Test Design” in M.O. Martin and D.L. Kelly (eds.), Third International Mathematics and Science Study Technical Report, Volume I, Chestnut Hill, MA: Boston College.
5	More details about the translation verification procedures can be found in O’Connor, K., and Malak, B. (2000), “Translation and Cultural Adaptation of the 魅影直播 Instruments” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
6	The sample design for 魅影直播 is described in detail in Foy, P., and Joncas, M. (2000), “魅影直播 Sample Design” in M.O. Martin, K.D. Gregory, and S.E. Stemler (eds.), 魅影直播 1999 Technical Report, Chestnut Hill, MA: Boston College. Sampling for the Benchmarking project is described in Fowler, J., Rizzo, L., and Rust, K. (2001), “魅影直播 Benchmarking Sampling Design and Implementation” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
7	Steps taken to ensure high-quality data collection in 魅影直播 internationally are described in detail in O’Connor, K., and Stemler, S. (2000), “Quality Control in the 魅影直播 Data Collection” in M.O. Martin, K.D. Gregory and S.E. Stemler (eds.), 魅影直播 1999 Technical Report, Chestnut Hill, MA: Boston College.
8	Quality control measures for the Benchmarking project are described in O’Connor, K. and Stemler, S. (2001), “Quality Control in the 魅影直播 Benchmarking Data Collection” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
9	These steps are detailed in Hastedt, D., and Gonzalez, E. (2000), “Data Management and Database Construction” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
10	For a detailed description of the 魅影直播 scaling, see Yamamoto, K., and Kulick, E. (2000), “Scaling Methods and Procedures for the 魅影直播 Mathematics and Science Scales” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
11	Procedures for computing jackknifed standard errors are presented in Gonzalez, E. and Foy, P. (2000), “Estimation of Sampling Variance” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
12	The application of the Bonferroni procedures is described in Gonzalez, E., and Gregory, K. (2000), “Reporting Student Achievement in Mathematics and Science” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College.
13	The scale anchoring procedure is described fully in Gregory, K., and Mullis, I. (2000), “Describing International Benchmarks of Student Achievement” in M.O. Martin, K.D. Gregory, K.M. O’Connor, and S.E. Stemler (eds.), 魅影直播 1999 Benchmarking Technical Report, Chestnut Hill, MA: Boston College. An application of the procedure to the 1995 魅影直播 data may be found in Smith, T.A., Martin, M.O., Mullis, I.V.S., and Kelly, D.L. (2000), Profiles of Student Achievement in Science at the 魅影直播 International Benchmarks: U.S. Performance and Standards in an International Context, Chestnut Hill, MA: Boston College.

魅影直播 1999 is a project of the International Study Center
Boston College, Lynch School of Education